Skip to content

perf(l1): reduce BAL parallel-path overhead#6639

Closed
edg-l wants to merge 4 commits into
mainfrom
perf/bal-parallel-overhead-rebased
Closed

perf(l1): reduce BAL parallel-path overhead#6639
edg-l wants to merge 4 commits into
mainfrom
perf/bal-parallel-overhead-rebased

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented May 13, 2026

Summary

Rebased onto current main (post bal-devnet-7-pr squash merge). Scope reduced from the original bundle: dropped the CachingDatabase RwLock<HashMap>DashMap swap and the 500ms → 100ms import-bench sleep tweak. Both require coordinated zkVM-feature-gating work that isn't done here.

Remaining changes target the BAL parallel-execution path (execute_block_parallel + handle_merkleization_bal + seed_db_from_bal). All changes are within already rayon-gated code paths -- zkVM builds (sp1/risc0/zisk/eip-8025) are unaffected.

What is in this PR

  • A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs): for updates in rx { ... } blocked until channel close (= exec end). execute_block_parallel sends exactly one batch up front from bal_to_account_updates, so draining nothing useful serialized Stage B (parallel storage roots) after exec instead of overlapping with it. Replaced with a single rx.recv() and dropped the FxHashMap merge step (BAL guarantees one entry per address).
  • Q1. Skip prestate read in bal_to_account_updates when BAL covers all info fields (crates/vm/backends/levm/mod.rs): two fast paths added -- storage-only updates (info: None, removed: false by construction); full info coverage with non-empty post (removal impossible, info from BAL alone). Slow path keeps existing behavior for partial coverage.
  • Q2. Per-tx GeneralizedDatabase capacity cap at 32 (execute_block_parallel): previously sized to bal.accounts().len() (often 100s on stress blocks); p50 tx touches <10 accounts. Reduced allocator pressure across rayon workers.
  • Q3. Memoize code_from_bal results across seed_db_from_bal calls: pre-compute Code objects (hash + jump_targets) once per BAL code change before the par_iter; pass cache via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target scans per code change per block (N = tx count).
  • Q8. Move per-tx BAL validation into the rayon par_iter closure (execute_block_parallel): eliminates a serial post-exec validation pass (~3 ms median across 200 txs). Drops current_state and codes inside the closure after validation runs -- they no longer cross the rayon boundary, reducing per-tx allocator pressure. Closure returns deferred Option<EvmError> so gas-limit check still takes priority over BAL mismatch errors.
  • Coinbase exemption fix: gated the post-exec unaccessed_pure_accounts coinbase removal on !exec_results.is_empty() to fix 0-tx-block regression (Q8 had hoisted the removal outside the per-tx loop, silently exempting coinbase on empty/withdrawal-only blocks). Found via EELS test_bal_invalid_extraneous_coinbase[empty_block|withdrawal_only].
  • Pre-size backup_storage_slot inner map (separate commit): avoids rehash chain growth in the slot backup tracker.

What was dropped from the original bundle (deferred to follow-ups)

  • DashMap. CachingDatabase RwLock → DashMap: the Cargo.toml rayon gating from fix(l1): re-enable rayon feature on blockchain crate #6662 makes the levm crate's rayon dep optional (zkVM builds skip rayon entirely). The DashMap swap rewrites struct fields and methods unconditionally; to land cleanly it needs paired zkVM-fallback variants matching the existing #[cfg(all(feature = "rayon", not(feature = "eip-8025")))] pattern. Split out to its own PR.
  • C. import-bench 500ms → 100ms sleep: superseded by feat(l1): improve rlp import #6666 which replaces the sleep with Store::wait_for_persistence_idle() entirely.

Headline numbers (from the original bundle, before reduced scope)

460-block mainnet-mix fixture from bal-devnet-7 kurtosis localnet (~670 tx/s spamoor mix, 65 Mgas median, ~200 tx/block, 452/460 blocks carry BAL):

Metric Value
Wall time 5.06 s
Per-block median ~8 ms
Per-block Ggas/s median 7.67
Per-block Ggas/s p90 8.32
Per-block Ggas/s max 13.49
Warmer/exec overlap 99 %

Numbers above include the DashMap change; expect a regression of a few % once that's removed (perf record showed 11% CPU on the RwLock contention -- DashMap removed that; without it some of that comes back, exact figure pending re-bench).

Test plan

  • cargo check -p ethrex (default features) -- clean
  • cargo check -p ethrex-levm --no-default-features --features sp1 (zkVM mode, no rayon) -- clean
  • cargo fmt --check -- clean
  • CI: make test, EF tests, Hive Amsterdam
  • Re-bench on 460-block fixture without DashMap to quantify regression

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

@github-actions github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels May 13, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Lines of code report

Total lines added: 67
Total lines removed: 5
Total lines changed: 72

Detailed view
+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2541  | -5   |
+----------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs  | 2515  | +67  |
+----------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Benchmark Results Comparison

No significant difference was registered for any benchmark run.

Detailed Results

Benchmark Results: BubbleSort

Command Mean [s] Min [s] Max [s] Relative
main_revm_BubbleSort 2.969 ± 0.022 2.944 3.013 1.09 ± 0.01
main_levm_BubbleSort 2.723 ± 0.015 2.707 2.743 1.00
pr_revm_BubbleSort 3.202 ± 0.655 2.974 5.066 1.18 ± 0.24
pr_levm_BubbleSort 2.733 ± 0.020 2.701 2.759 1.00 ± 0.01

Benchmark Results: ERC20Approval

Command Mean [s] Min [s] Max [s] Relative
main_revm_ERC20Approval 1.006 ± 0.074 0.970 1.215 1.02 ± 0.08
main_levm_ERC20Approval 1.051 ± 0.015 1.039 1.090 1.07 ± 0.02
pr_revm_ERC20Approval 0.984 ± 0.012 0.968 1.006 1.00
pr_levm_ERC20Approval 1.050 ± 0.008 1.040 1.068 1.07 ± 0.02

Benchmark Results: ERC20Mint

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Mint 133.7 ± 5.8 130.2 149.8 1.01 ± 0.05
main_levm_ERC20Mint 161.0 ± 0.5 160.3 161.8 1.21 ± 0.02
pr_revm_ERC20Mint 132.5 ± 1.9 131.1 137.6 1.00
pr_levm_ERC20Mint 160.0 ± 1.3 158.8 162.6 1.21 ± 0.02

Benchmark Results: ERC20Transfer

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ERC20Transfer 231.6 ± 0.7 230.3 232.8 1.00 ± 0.01
main_levm_ERC20Transfer 263.8 ± 2.5 261.0 268.6 1.14 ± 0.01
pr_revm_ERC20Transfer 231.2 ± 1.8 229.6 234.7 1.00
pr_levm_ERC20Transfer 263.7 ± 5.0 259.9 277.2 1.14 ± 0.02

Benchmark Results: Factorial

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Factorial 225.3 ± 1.1 223.6 227.5 1.00 ± 0.01
main_levm_Factorial 245.2 ± 3.2 243.4 254.2 1.09 ± 0.01
pr_revm_Factorial 225.2 ± 0.4 224.3 225.7 1.00
pr_levm_Factorial 242.5 ± 2.4 240.4 248.6 1.08 ± 0.01

Benchmark Results: FactorialRecursive

Command Mean [s] Min [s] Max [s] Relative
main_revm_FactorialRecursive 1.620 ± 0.022 1.591 1.647 1.01 ± 0.05
main_levm_FactorialRecursive 8.631 ± 0.043 8.533 8.684 5.37 ± 0.23
pr_revm_FactorialRecursive 1.607 ± 0.068 1.479 1.680 1.00
pr_levm_FactorialRecursive 8.645 ± 0.041 8.579 8.712 5.38 ± 0.23

Benchmark Results: Fibonacci

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Fibonacci 201.8 ± 1.5 198.6 203.6 1.00 ± 0.02
main_levm_Fibonacci 225.1 ± 18.6 217.1 277.4 1.12 ± 0.10
pr_revm_Fibonacci 201.6 ± 4.5 199.4 214.1 1.00
pr_levm_Fibonacci 219.1 ± 3.6 215.6 226.4 1.09 ± 0.03

Benchmark Results: FibonacciRecursive

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_FibonacciRecursive 846.7 ± 9.3 825.5 857.5 1.26 ± 0.02
main_levm_FibonacciRecursive 671.8 ± 7.0 662.3 684.0 1.00
pr_revm_FibonacciRecursive 855.0 ± 10.2 838.9 870.6 1.27 ± 0.02
pr_levm_FibonacciRecursive 680.6 ± 11.7 671.5 711.2 1.01 ± 0.02

Benchmark Results: ManyHashes

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_ManyHashes 8.4 ± 0.2 8.2 8.7 1.01 ± 0.02
main_levm_ManyHashes 9.7 ± 0.1 9.6 9.8 1.17 ± 0.01
pr_revm_ManyHashes 8.3 ± 0.1 8.2 8.4 1.00
pr_levm_ManyHashes 9.7 ± 0.2 9.5 10.0 1.16 ± 0.02

Benchmark Results: MstoreBench

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_MstoreBench 257.2 ± 2.3 254.9 262.3 1.14 ± 0.01
main_levm_MstoreBench 225.0 ± 0.7 223.9 225.9 1.00
pr_revm_MstoreBench 261.5 ± 8.3 254.0 274.3 1.16 ± 0.04
pr_levm_MstoreBench 225.0 ± 1.3 222.9 226.7 1.00 ± 0.01

Benchmark Results: Push

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_Push 293.2 ± 1.1 291.7 295.3 1.08 ± 0.01
main_levm_Push 272.4 ± 1.3 270.4 275.2 1.00 ± 0.01
pr_revm_Push 294.1 ± 1.7 291.8 297.3 1.08 ± 0.01
pr_levm_Push 271.6 ± 1.4 269.8 274.4 1.00

Benchmark Results: SstoreBench_no_opt

Command Mean [ms] Min [ms] Max [ms] Relative
main_revm_SstoreBench_no_opt 165.5 ± 2.1 162.0 170.0 1.65 ± 0.03
main_levm_SstoreBench_no_opt 101.2 ± 2.6 99.3 108.6 1.01 ± 0.03
pr_revm_SstoreBench_no_opt 164.7 ± 1.8 161.8 166.8 1.64 ± 0.02
pr_levm_SstoreBench_no_opt 100.6 ± 0.9 99.5 102.3 1.00

@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 75c47b2 to e792377 Compare May 13, 2026 09:36
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 103.405 ± 0.545 102.733 104.366 1.00 ± 0.01
head 103.200 ± 0.167 103.029 103.547 1.00

@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from e792377 to 161f583 Compare May 13, 2026 11:00
Base automatically changed from bal-devnet-6-pr to main May 13, 2026 13:07
@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 161f583 to 7e081a6 Compare May 14, 2026 08:05
@edg-l edg-l changed the base branch from main to bal-devnet-7-pr May 14, 2026 08:06
@edg-l edg-l marked this pull request as ready for review May 15, 2026 14:52
@edg-l edg-l requested review from a team, ManuelBilbao and ilitteri as code owners May 15, 2026 14:52
@ethrex-project-sync ethrex-project-sync Bot moved this to In Review in ethrex_l1 May 15, 2026
@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have enough context for a thorough review.


PR Review: perf(l1): reduce BAL parallel-path overhead

Overall Assessment

This is a well-engineered performance bundle with clear, documented reasoning for each optimization. The changes show strong understanding of the BAL parallel execution model. Test coverage is solid (EF tests, Hive, mainnet-mix fixture). A few correctness subtleties warrant attention.


Correctness Concerns

blockchain.rs: Single-recv assumption is fragile (medium severity)

// Line 82
let updates: Vec<AccountUpdate> = match rx.recv() {
    Ok(updates) => { ... updates }
    Err(_) => Vec::new()
};

The change replaces a draining loop with a single recv(), relying on the invariant that execute_block_parallel sends exactly one batch. This is documented in the comment, and verified to be true today (mod.rs:1086-1089). However:

  • The contract lives entirely in code comments. If a future refactor sends a second batch (e.g. for incremental streaming), the second batch is silently dropped — no panic, no error, just a truncated state root.
  • Consider adding a debug assertion: debug_assert!(rx.try_recv().is_err(), "merkleizer received unexpected second batch") after the recv(). This costs nothing in release builds but would catch protocol drift in CI.

bal_to_account_updates: Fast path 1 removed: false assumption

// ~line 740 (mod.rs)
if !has_info_changes {
    updates.push(AccountUpdate { ..., removed: false, info: None, ... });
    continue;
}

The logic is correct: without info changes, balance/nonce/code_hash are identical in pre- and post-state, so the account cannot transition to empty (removed = post_empty && !pre_empty is always false). The reasoning holds. But the comment ("can't be a removal … by definition") glosses over the "pre-state might already be empty" edge case — it's actually fine because if pre was empty AND storage changes arrived, the pre-empty invariant was already false. Worth noting for future readers.

unread_storage_reads cleanup: semantic equivalence confirmed

The refactoring from per-tx current_state iteration to pre-computed destroyed_addresses/read_keys vecs is semantically identical to the serial loop. The retain(|&(a, _)| a != *addr) call for destroyed accounts is O(n) per destroyed account, same as before — acceptable given destroyed accounts are rare.

Validation error ordering change

In the old serial loop, BAL validation for tx N short-circuited before tx N+1 ran. Now all tx validations execute in parallel, and only the first error (by tx index, since exec_results is sorted) is surfaced after the gas check. This is a behavior-preserving change for the common case, but if validation error messages ever become important for debugging, the silent discard of later errors could be surprising. Consider logging suppressed errors at warn! level.


Correctness Fix (Positive)

Coinbase exemption for 0-tx blocks (mod.rs:1395):

if !unaccessed_pure_accounts.is_empty() && !exec_results.is_empty() {
    unaccessed_pure_accounts.remove(&header.coinbase);

The added && !exec_results.is_empty() guard is a genuine correctness improvement. Empty blocks / withdrawal-only blocks never trigger fee finalization, so a BAL entry for the coinbase in such blocks is legitimately extraneous and should fail validation. This matches the EELS test (test_bal_invalid_extraneous_coinbase) and is correct.


Design / Structural Notes

wait_for_persistence_idle: Rendezvous correctness verified

The Ping-based idle detection is elegant. Confirmed that trie_upd_tx/trie_upd_rx are created with sync_channel(0) (store.rs:1557), so the rendezvous semantics hold: a successful send proves the worker is back at recv(). The only gap is that this invariant isn't enforced at the type level — if someone changes sync_channel(0) to sync_channel(1), Pings could buffer and the method would return a false positive. A comment at the channel creation site pointing to wait_for_persistence_idle would help protect against this:

// INVARIANT: capacity must stay 0 (rendezvous) — wait_for_persistence_idle
// depends on send-blocking to detect worker idle. See Store::wait_for_persistence_idle.
let (trie_upd_tx, trie_upd_rx) = std::sync::mpsc::sync_channel(0);

code_cache index binding (mod.rs:1138–1147 and ~947–967)

code_cache[acct_idx][code_idx] assumes code_cache is built from bal.accounts() in the same order as accounts_by_min_index resolves its .1 field. Both are derived from the same bal.accounts() slice, so this holds today. It's an implicit invariant — a brief comment at the cache[acct_idx] access site noting "acct_idx is an index into bal.accounts()" would make the coupling explicit.

BalAccountCodeCache type name

type BalAccountCodeCache = Vec<(H256, Option<Code>)>;

The name is slightly misleading: it's not a cache for the whole BAL, it's a per-account cache of code-change entries. PerAccountCodeChanges or BallAccountCodeChanges would be more descriptive, though this is minor.


Performance Observations

DashMap migration is well-motivated. The switch from RwLock<FxHashMap> removes the write-lock bottleneck under rayon parallelism. Using FxBuildHasher preserves the fast hash for short fixed-size keys (Address, H256). The TOCTOU double-fetch pattern on cache miss is retained from the old design and is acceptable (idempotent reads from backing store).

PER_TX_DB_CAPACITY = 32 is a reasonable constant. The comment notes "most txs touch <10 accounts" — with a load factor of ~0.87 in hashbrown, capacity 32 → 28 usable entries before a resize. For typical txs this should be sufficient.

or_insert_with(|| FxHashMap::with_capacity_and_hasher(8, ...)) in gen_db.rs: The pre-sizing comment is well-supported by the flamegraph reference. A capacity of 8 means one initial alloc handles the common case of 5–8 SSTOREs per account per call frame cleanly.


Minor Nits

  • prefetch_accounts and prefetch_storage now return the try_for_each result directly (no trailing Ok(())). This is idiomatic Rust and correct, but is a style break from the rest of the file which uses explicit Ok(()). Consistent either way.
  • The expect("checked by has_full_info_coverage") calls in fast path 2 are safe and the message is useful. Prefer keeping them over unwrap() for diagnosability.
  • drop(current_state); drop(codes); on lines 1322-1323 are explicit drops that the comment justifies (to avoid crossing the rayon boundary). The drops are correct but redundant since both are consumed by the validation closure before the Ok((...)) return. Rust would drop them at end-of-scope anyway, but the explicit drops reinforce the intent.

Summary

  • Merkleization overlap fix: correct, but the single-recv contract should be guarded with a debug assertion.
  • Fast paths in bal_to_account_updates: logically sound; the storage-only fast path corner case around pre-empty accounts is correctly handled.
  • TrieMessage::Ping: elegant, correct, but the sync_channel(0) invariant needs protection at the creation site.
  • DashMap migration: clean, removes the PoisonError code path, measurably correct.
  • Coinbase exemption fix: genuine correctness improvement.
  • Code cache: correct, implicit index invariant worth documenting.

Automated review by Claude (Anthropic) · sonnet · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No blocking correctness or security findings from static review. The deferred BAL-validation ordering and the trie-worker idle handshake both look coherent.

Non-blocking perf note:

  1. crates/vm/backends/levm/mod.rs:747 still calls prefetch_accounts for every write account before the new fast-path split at crates/vm/backends/levm/mod.rs:791. That means storage-only and full-info/non-empty accounts still pay the underlying get_account_state lookup, so the new fast paths mostly avoid a later cache hit rather than the actual DB read. If this PR’s goal is BAL-path latency reduction, filtering prefetches down to slow-path candidates would capture more of the intended win.

Testing gap:

  • I couldn’t run the targeted EIP-7928/EIP-8037 test slice here. cargo test is blocked in this environment by read-only ~/.cargo / ~/.rustup state plus an unfetched git dependency (libssz), so this review is source-only.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR is a performance bundle for the BAL parallel-execution path rebased onto bal-devnet-7-pr, delivering ~+26% throughput over the unpatched parallel baseline through eight independent improvements.

  • Overlap fix (blockchain.rs): Stage A now recv() once instead of draining the channel, allowing Stage B (parallel storage roots) to overlap with still-in-flight rayon execution; FxHashMap merge step dropped since BAL guarantees one entry per address.
  • DB contention eliminated (db/mod.rs): RwLock<FxHashMap> caches replaced with DashMap<_, _, FxBuildHasher>, removing ~11% CPU cost from read_contended under 16 rayon workers; prefetch_accounts/prefetch_storage simplified with par_iter + entry.or_insert.
  • Per-tx allocator pressure and BAL validation parallelised (levm/mod.rs): GeneralizedDatabase capacity capped at 32, code objects pre-computed once per BAL code change, per-tx validation moved inside the rayon closure with deferred errors surfaced after the gas-limit check.

Note: change B in the PR description (BAL_PARALLEL_TX_THRESHOLD = 5 adaptive fallback) is described but not present in the diff — worth confirming whether it was intentionally dropped during the rebase.

Confidence Score: 4/5

The parallel execution path changes are logically correct and well-tested; the two minor concerns do not affect production block processing.

Core logic is sound. Main risk is the implicit one-batch invariant in handle_merkleization_bal — a future modification sending a second batch would silently produce a wrong state root. A debug_assert after the Ok arm would fully mitigate this.

crates/blockchain/blockchain.rs (single-recv invariant and error-path Stage C) and crates/vm/backends/levm/mod.rs (deferred BAL validation fast paths)

Important Files Changed

Filename Overview
crates/blockchain/blockchain.rs Stage A changed from channel drain to single rx.recv(); allows Stage B to overlap with parallel exec. Introduces a fragile one-batch invariant and triggers Stage C (16 trie opens) on the error path.
crates/vm/backends/levm/mod.rs BAL validation moved into rayon closures, code_cache pre-computed, per-tx DB capacity capped, two fast paths added to bal_to_account_updates. Logic correct; deferred-error order preserved.
crates/storage/store.rs Adds TrieMessage enum with Ping variant; wait_for_persistence_idle() correctly exploits sync_channel(0) rendezvous to signal worker idle state.
crates/vm/levm/src/db/mod.rs RwLock replaced with DashMap<_, _, FxBuildHasher> for all three caches; prefetch methods simplified with par_iter + DashMap entry API.
cmd/ethrex/cli.rs Magic-number sleep replaced by wait_for_persistence_idle(); bench-tool only change with no production effect.
crates/vm/levm/Cargo.toml Adds dashmap 6.1 as a direct dep rather than a workspace dep; minor consistency nit.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/blockchain/blockchain.rs:870-882
**Single-recv invariant has no defensive check**

`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.

### Issue 2 of 3
crates/blockchain/blockchain.rs:876-881
**`Err(_)` path continues through Stage C (16 trie opens)**

When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.

### Issue 3 of 3
crates/vm/levm/Cargo.toml:23
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

```suggestion
dashmap.workspace = true
```

Reviews (1): Last reviewed commit: "perf(l1): pre-size backup_storage_slot i..." | Re-trigger Greptile

Comment on lines +870 to +882
let updates: Vec<AccountUpdate> = match rx.recv() {
Ok(updates) => {
let current_length = queue_length.fetch_sub(1, Ordering::Acquire);
*max_queue_length = current_length.max(*max_queue_length);
updates
}
}
Err(_) => {
// Channel closed without a message — execution failed before
// bal_to_account_updates ran. Return empty work so the exec
// error surfaces in execution_result rather than being masked.
Vec::new()
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Single-recv invariant has no defensive check

handle_merkleization_bal now consumes exactly one message and then never touches rx again. If execute_block_parallel is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel") after the Ok arm would catch any accidental protocol change during development.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 870-882

Comment:
**Single-recv invariant has no defensive check**

`handle_merkleization_bal` now consumes exactly one message and then never touches `rx` again. If `execute_block_parallel` is ever modified to send a second batch, the extra message will sit in the unbounded channel and be silently dropped — the merkleizer will proceed with only the first batch, producing a wrong state root with no error. A defensive `debug_assert!(rx.try_recv().is_err(), "expected exactly one batch from execute_block_parallel")` after the `Ok` arm would catch any accidental protocol change during development.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +876 to +881
Err(_) => {
// Channel closed without a message — execution failed before
// bal_to_account_updates ran. Return empty work so the exec
// error surfaces in execution_result rather than being masked.
Vec::new()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Err(_) path continues through Stage C (16 trie opens)

When the channel is closed without a message (execution failure before bal_to_account_updates), the function returns Vec::new() and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty AccountUpdatesList early in the Err arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via execution_result? regardless.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 876-881

Comment:
**`Err(_)` path continues through Stage C (16 trie opens)**

When the channel is closed without a message (execution failure before `bal_to_account_updates`), the function returns `Vec::new()` and falls through to Stage C, which unconditionally spawns 16 threads to open the parent state trie even though all shards will have no items. Returning an empty `AccountUpdatesList` early in the `Err` arm would avoid this overhead without changing the visible behaviour, since the execution error surfaces via `execution_result?` regardless.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread crates/vm/levm/Cargo.toml Outdated
strum = { version = "0.27.1", features = ["derive"] }
rustc-hash.workspace = true
rayon.workspace = true
dashmap = "6.1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 dashmap is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses workspace = true. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

Suggested change
dashmap = "6.1"
dashmap.workspace = true
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/vm/levm/Cargo.toml
Line: 23

Comment:
`dashmap` is added as an inline version requirement rather than routing through the workspace. Nearly every other dep in this file uses `workspace = true`. A workspace entry ensures the version is bumped in one place and keeps the lockfile diff minimal.

```suggestion
dashmap.workspace = true
```

How can I resolve this? If you propose a fix, please make it concise.

Base automatically changed from bal-devnet-7-pr to main May 18, 2026 10:17
@edg-l edg-l requested a review from a team as a code owner May 18, 2026 10:17
edg-l added 3 commits May 18, 2026 14:00
Bundle of independent improvements to the BAL parallel-execution path
(execute_block_parallel + handle_merkleization_bal + warm_block_from_bal +
CachingDatabase), validated against a 149-block stress fixture (100M gas,
200-500 tx/block, ~25M-gas median blocks).

The changes (each is independently shippable; combined here for atomic
review since they touch overlapping code):

A. handle_merkleization_bal overlap fix (crates/blockchain/blockchain.rs)
   `for updates in rx { ... }` blocked until channel close (= exec end).
   execute_block_parallel sends exactly one batch up front from
   bal_to_account_updates, so draining nothing useful serialized Stage B
   (parallel storage roots) after exec instead of overlapping with it.
   Replaced with a single rx.recv() and dropped the FxHashMap merge step
   (BAL guarantees one entry per address).

C. import-bench inter-block sleep 500ms -> 100ms (cmd/ethrex/cli.rs)
   Bench tooling change. The sleep gates background trie-layer writeback
   from bleeding into the next block's per-block timer; 100ms is well
   above measured Phase 2 cost on SSD. Cuts bench wall clock 80% without
   affecting the per-block metric. NO effect on production paths.

Q1. Skip prestate read in bal_to_account_updates when BAL covers all info
    fields (crates/vm/backends/levm/mod.rs). Two fast paths added:
    storage-only updates (info: None, removed: false by construction);
    full info coverage with non-empty post (removal impossible, info from
    BAL alone). Slow path keeps existing behavior for partial coverage.

Q2. Per-tx GeneralizedDatabase capacity cap at 32
    (crates/vm/backends/levm/mod.rs::execute_block_parallel). Previously
    sized to bal.accounts().len() (often 100s on stress blocks); p50 tx
    touches <10 accounts. Reduced allocator pressure across rayon workers.

Q3. Memoize code_from_bal results across seed_db_from_bal calls
    (crates/vm/backends/levm/mod.rs). Pre-compute Code objects (hash +
    jump_targets) once per BAL code change before the par_iter; pass cache
    via optional param to seed_db_from_bal. Saves N-1 keccak+jump-target
    scans per code change per block (N = tx count).

Q8. Move per-tx BAL validation into the rayon par_iter closure
    (crates/vm/backends/levm/mod.rs::execute_block_parallel). Eliminates a
    serial post-exec validation pass (~3 ms median across 200 txs). Drops
    current_state and codes inside the closure after validation runs —
    they no longer cross the rayon boundary, reducing per-tx allocator
    pressure. Closure returns deferred Option<EvmError> so gas-limit check
    still takes priority over BAL mismatch errors.

DashMap. CachingDatabase RwLock<HashMap> -> DashMap<_, _, FxBuildHasher>
    (crates/vm/levm/src/db/mod.rs). Found via perf record: 11% of CPU was
    RwLock::read_contended on the single account RwLock with 16 rayon
    workers hammering it. Sharded concurrent map (64 default shards)
    eliminates contention. Sequential paths unaffected (only 2 threads
    access the cache, weren't contended).

Effect on non-BAL paths (block production, pre-Amsterdam, sequential
fallback): DashMap is neutral (low contention); other changes only fire
on the BAL parallel-validation path. No regressions in non-parallel paths.
Q8 in the BAL parallel-path perf bundle (7e081a6) moved per-tx BAL
validation into the rayon closure. As part of the refactor the
`unaccessed_pure_accounts.remove(&header.coinbase)` call was hoisted
out of the per-tx loop to run unconditionally on every parallel-path
invocation.

For 0-tx blocks (empty / withdrawal-only on Amsterdam+) that unconditional
removal silently exempts a BAL entry the protocol calls extraneous: fee
finalization never runs without a tx, so geth's readerTracker never touches
the coinbase either. A BAL coinbase entry on such a block is by
construction extraneous and must surface as a validation error.

Restoring the original gate (only exempt when at least one tx ran)
re-rejects the block. Verified against EELS
test_bal_invalid_extraneous_coinbase[empty_block] and [withdrawal_only].
CallFrameBackup::original_account_storage_slots starts each fresh account's
inner FxHashMap at capacity 0. The first few SSTOREs in any new tx trigger
hashbrown::reserve_rehash 3-4 times in sequence (0 → 4 → 8 → 16).

perf record on a 460-block bal-devnet-7 mainnet-mix fixture (200 tx/block,
~65 Mgas) showed hashbrown::reserve_rehash as the 7th hottest leaf at
3.02B samples. After pre-sizing to 8 the same leaf drops to 2.19B, a 27%
reduction in that frame and ~0.8% of total CPU recovered.

Wall-clock impact is sub-noise on this workload (per-tx CPU savings happen
inside rayon workers; wall-clock is bound by the longest tx per block) but
the CPU savings compound on heavier workloads where critical-path txs hit
the rehash chain.

Wastes ~256 B per untouched account; negligible.
@edg-l edg-l force-pushed the perf/bal-parallel-overhead-rebased branch from 51a768d to 4c4aba3 Compare May 18, 2026 12:00
@edg-l edg-l marked this pull request as draft May 18, 2026 12:01
@ethrex-project-sync ethrex-project-sync Bot moved this from In Review to In Progress in ethrex_l1 May 18, 2026
@edg-l
Copy link
Copy Markdown
Contributor Author

edg-l commented May 19, 2026

Closing. Of the original bundle:

@edg-l edg-l closed this May 19, 2026
@github-project-automation github-project-automation Bot moved this from Todo to Done in ethrex_performance May 19, 2026
@github-project-automation github-project-automation Bot moved this from In Progress to Done in ethrex_l1 May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant